{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NSS Tutorial\n",
    "\n",
    "Field\t| Type\n",
    "--------|-------\n",
    "ukprn\t| varchar(8)\n",
    "institution\t| varchar(100)\n",
    "subject\t| varchar(60)\n",
    "level\t| varchar(50)\n",
    "question\t| varchar(10)\n",
    "A_STRONGLY_DISAGREE\t| int(11)\n",
    "A_DISAGREE\t| int(11)\n",
    "A_NEUTRAL\t| int(11)\n",
    "A_AGREE\t| int(11)\n",
    "A_STRONGLY_AGREE\t| int(11)\n",
    "A_NA\t| int(11)\n",
    "CI_MIN\t| int(11)\n",
    "score\t| int(11)\n",
    "CI_MAX\t| int(11)\n",
    "response\t| int(11)\n",
    "sample\t| int(11)\n",
    "aggregate\t| char(1)\n",
    "\n",
    "National Student Survey 2012\n",
    "\n",
    "The National Student Survey <http://www.thestudentsurvey.com/> is presented to thousands of graduating students in UK Higher Education. The survey asks 22 questions, students can respond with STRONGLY DISAGREE, DISAGREE, NEUTRAL, AGREE or STRONGLY AGREE. The values in these columns represent PERCENTAGES of the total students who responded with that answer.\n",
    "\n",
    "The table `nss` has one row per institution, subject and question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "import findspark\n",
    "import pandas as pd\n",
    "findspark.init()\n",
    "\n",
    "SVR = '192.168.31.31'\n",
    "from pyspark.sql import SparkSession\n",
    "\n",
    "sc = (SparkSession.builder.appName('app08+') \n",
    "      .master(f'spark://{SVR}:7077') \n",
    "      .config('spark.sql.warehouse.dir', f'hdfs://{SVR}:9000/user/hive/warehouse') \n",
    "      .config('spark.cores.max', '4') \n",
    "      .config('spark.executor.instances', '1') \n",
    "      .config('spark.executor.cores', '2') \n",
    "      .config('spark.executor.memory', '10g') \n",
    "      .enableHiveSupport().getOrCreate())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Check out one row\n",
    "\n",
    "The example shows the number who responded for:\n",
    "\n",
    "- question 1\n",
    "- at 'Edinburgh Napier University'\n",
    "- studying '(8) Computer Science'\n",
    "\n",
    "**Show the the percentage who STRONGLY AGREE**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "nss = sc.read.table('sqlzoo.nss')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A_STRONGLY_AGREE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   A_STRONGLY_AGREE\n",
       "0                23"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(nss.filter((nss['question']=='Q01') & \n",
    "            (nss['institution']=='Edinburgh Napier University') & \n",
    "            (nss['subject']=='(8) Computer Science'))\n",
    " .select('A_STRONGLY_AGREE')\n",
    " .toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Calculate how many agree or strongly agree\n",
    "\n",
    "**Show the institution and subject where the score is at least 100 for question 15.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>institution</th>\n",
       "      <th>subject</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Kingston College</td>\n",
       "      <td>(I) Education</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Royal Holloway, University of London</td>\n",
       "      <td>(L) Geographical Studies</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Solihull College</td>\n",
       "      <td>(I) Education</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Stafford College</td>\n",
       "      <td>(D) Business and Administrative studies</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>University of Southampton</td>\n",
       "      <td>(E) Mass Communications and Documentation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>University of Wolverhampton</td>\n",
       "      <td>(7) Mathematical Sciences</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>University of Leicester</td>\n",
       "      <td>(2) Subjects allied to Medicine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>University of Newcastle upon Tyne</td>\n",
       "      <td>(E) Mass Communications and Documentation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Bishop Grosseteste University College, Lincoln</td>\n",
       "      <td>(F) Languages</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Universities of East Anglia and Essex; Joint P...</td>\n",
       "      <td>(G) Historical and Philosophical studies</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         institution   \n",
       "0                                   Kingston College  \\\n",
       "1               Royal Holloway, University of London   \n",
       "2                                   Solihull College   \n",
       "3                                   Stafford College   \n",
       "4                          University of Southampton   \n",
       "5                        University of Wolverhampton   \n",
       "6                            University of Leicester   \n",
       "7                  University of Newcastle upon Tyne   \n",
       "8     Bishop Grosseteste University College, Lincoln   \n",
       "9  Universities of East Anglia and Essex; Joint P...   \n",
       "\n",
       "                                     subject  \n",
       "0                              (I) Education  \n",
       "1                   (L) Geographical Studies  \n",
       "2                              (I) Education  \n",
       "3    (D) Business and Administrative studies  \n",
       "4  (E) Mass Communications and Documentation  \n",
       "5                  (7) Mathematical Sciences  \n",
       "6            (2) Subjects allied to Medicine  \n",
       "7  (E) Mass Communications and Documentation  \n",
       "8                              (F) Languages  \n",
       "9   (G) Historical and Philosophical studies  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(nss.filter((nss['question']=='Q15') & (nss['score']>=100))\n",
    " .select('institution', 'subject')\n",
    " .toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Unhappy Computer Students\n",
    "\n",
    "**Show the institution and score where the score for '(8) Computer Science' is less than 50 for question 'Q15'**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>institution</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Blackburn College</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>North Lindsey College</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Plymouth College of Art</td>\n",
       "      <td>47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Somerset College of Arts and Technology</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>University of Wales, Newport</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Universities of East Anglia and Essex; Joint P...</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         institution  score\n",
       "0                                  Blackburn College     45\n",
       "1                              North Lindsey College     30\n",
       "2                            Plymouth College of Art     47\n",
       "3            Somerset College of Arts and Technology     48\n",
       "4                       University of Wales, Newport     30\n",
       "5  Universities of East Anglia and Essex; Joint P...     45"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(nss.filter((nss['question']=='Q15') & \n",
    "           (nss['subject']=='(8) Computer Science') & \n",
    "           (nss['score']<50))\n",
    " .select('institution', 'score')\n",
    " .toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. More Computing or Creative Students?\n",
    "\n",
    "**Show the subject and total number of students who responded to question 22 for each of the subjects '(8) Computer Science' and '(H) Creative Arts and Design'.**\n",
    "\n",
    "> _HINT_    \n",
    "> You will need to use SUM over the response column and GROUP BY subject"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject</th>\n",
       "      <th>sum(response)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>(8) Computer Science</td>\n",
       "      <td>10252</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>(H) Creative Arts and Design</td>\n",
       "      <td>33336</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        subject  sum(response)\n",
       "0          (8) Computer Science          10252\n",
       "1  (H) Creative Arts and Design          33336"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(nss.filter((nss['question']=='Q22') & \n",
    "            (nss['subject'].isin(\n",
    "                ['(8) Computer Science', '(H) Creative Arts and Design'])))\n",
    " .groupBy('subject')\n",
    " .agg({'response': 'sum'})\n",
    " .toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Strongly Agree Numbers\n",
    "\n",
    "**Show the subject and total number of students who A_STRONGLY_AGREE to question 22 for each of the subjects '(8) Computer Science' and '(H) Creative Arts and Design'.**\n",
    "\n",
    "> _HINT_    \n",
    "> The A_STRONGLY_AGREE column is a percentage. To work out the total number of students who strongly agree you must multiply this percentage by the number who responded (response) and divide by 100 - take the SUM of that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject</th>\n",
       "      <th>sum(n_strongly_agree)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>(8) Computer Science</td>\n",
       "      <td>3355</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>(H) Creative Arts and Design</td>\n",
       "      <td>11992</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        subject  sum(n_strongly_agree)\n",
       "0          (8) Computer Science                   3355\n",
       "1  (H) Creative Arts and Design                  11992"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pyspark.sql.functions import *\n",
    "(nss.withColumn('n_strongly_agree', nss['response']*nss['A_STRONGLY_AGREE']/100)\n",
    "     .filter((nss['question']=='Q22') &\n",
    "             (nss['subject'].isin(\n",
    "                 ['(8) Computer Science', '(H) Creative Arts and Design'])))\n",
    "    .select('subject', col('n_strongly_agree').cast('int'))\n",
    "    .groupBy('subject')\n",
    "    .sum()\n",
    "    .toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Strongly Agree, Percentage\n",
    "\n",
    "**Show the percentage of students who A_STRONGLY_AGREE to question 22 for the subject '(8) Computer Science' show the same figure for the subject '(H) Creative Arts and Design'.**\n",
    "\n",
    "Use the **ROUND** function to show the percentage without decimal places."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject</th>\n",
       "      <th>pct</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>(8) Computer Science</td>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>(H) Creative Arts and Design</td>\n",
       "      <td>36.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        subject   pct\n",
       "0          (8) Computer Science  33.0\n",
       "1  (H) Creative Arts and Design  36.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(nss.withColumn('n_sa', nss['A_STRONGLY_AGREE']*nss['response'])\n",
    "    .filter((nss['question']=='Q22') &\n",
    "            (nss['subject'].isin(\n",
    "                ['(8) Computer Science', '(H) Creative Arts and Design'])))\n",
    "    .select('subject', 'n_sa', 'response')\n",
    "    .groupBy('subject')\n",
    "    .sum()\n",
    "    .withColumn('pct', round(col('sum(n_sa)')/col('sum(response)'), 0))\n",
    "    .select('subject', 'pct')\n",
    "    .toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Scores for Institutions in Manchester\n",
    "\n",
    "**Show the average scores for question 'Q22' for each institution that include 'Manchester' in the name.**\n",
    "\n",
    "The column **score** is a percentage - you must use the method outlined above to multiply the percentage by the **response** and divide by the total response. Give your answer rounded to the nearest whole number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>institution</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Manchester Metropolitan University</td>\n",
       "      <td>81.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>University of Manchester</td>\n",
       "      <td>83.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The Manchester College</td>\n",
       "      <td>72.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          institution  score\n",
       "0  Manchester Metropolitan University   81.0\n",
       "1            University of Manchester   83.0\n",
       "2              The Manchester College   72.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(nss.withColumn('score', nss['response']*nss['score'])\n",
    "    .filter((nss['question']=='Q22') & (nss['institution'].contains('Manchester')))\n",
    "    .select('institution', 'score', 'response')\n",
    "    .groupBy('institution')\n",
    "    .sum()\n",
    "    .withColumn('score', round(col('sum(score)')/col('sum(response)'), 0))\n",
    "    .select('institution', 'score')\n",
    "    .toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.Number of Computing Students in Manchester\n",
    "\n",
    "**Show the institution, the total sample size and the number of computing students for institutions in Manchester for 'Q01'.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>institution</th>\n",
       "      <th>sum(sample)</th>\n",
       "      <th>sum(comp)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Manchester Metropolitan University</td>\n",
       "      <td>6994</td>\n",
       "      <td>310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>University of Manchester</td>\n",
       "      <td>8065</td>\n",
       "      <td>180</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>The Manchester College</td>\n",
       "      <td>537</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          institution  sum(sample)  sum(comp)\n",
       "0  Manchester Metropolitan University         6994        310\n",
       "1            University of Manchester         8065        180\n",
       "2              The Manchester College          537         46"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(nss.filter((nss['question']=='Q01') & \n",
    "            (nss['institution'].contains('Manchester')))\n",
    " .select('institution', 'sample', 'subject',\n",
    "        when(nss['subject']=='(8) Computer Science', nss['sample'])\n",
    "        .otherwise(0).alias('comp'))\n",
    " .groupBy('institution')\n",
    " .sum()\n",
    " .toPandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.stop()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
